Skip to content

feat(source/cloud-storage): add Cloud Storage source with list_objects and read_object tools#3081

Open
huangjiahua wants to merge 6 commits intogoogleapis:mainfrom
huangjiahua:feat/cloud-storage-source
Open

feat(source/cloud-storage): add Cloud Storage source with list_objects and read_object tools#3081
huangjiahua wants to merge 6 commits intogoogleapis:mainfrom
huangjiahua:feat/cloud-storage-source

Conversation

@huangjiahua
Copy link
Copy Markdown

@huangjiahua huangjiahua commented Apr 16, 2026

Description

Adds Google Cloud Storage as a first-class source in MCP Toolbox, enabling LLM agents to work with objects across buckets in a GCP project. The source is project-scoped and authenticates via Application Default Credentials, mirroring Firestore/Bigtable.

This first PR ships the source plus two read-only tools from the approved design (14 total):

  • cloud-storage-list-objects — prefix filter, delimiter-based grouping (returns prefixes), and pagination via max_results / page_token. Passes through whatever metadata the GCS client returns (*storage.ObjectAttrs) so we don't have to plumb new fields later.
  • cloud-storage-read-object — reads an object's bytes, textual data only, with optional HTTP-style byte ranges (bytes=0-999, bytes=-500, bytes=500-).

GCS-aware error categorization (per DEVELOPER.md) is implemented in a new cloudstoragecommon helper that maps GCS sentinels and *googleapi.Error codes to Agent errors (missing bucket/object, bad request, unsatisfiable range) vs. Server errors (auth, IAM denial, quota, 5xx, context cancellation). This replaces the coarse util.ProcessGcpError for the two new tools.

Remaining 12 tools from the design doc (list_buckets, create_bucket, copy/move/delete_object, etc.) will land in follow-up PRs.

CI note: the cloud-storage shard in .ci/integration.cloudbuild.yaml expects CLOUD_STORAGE_PROJECT=$PROJECT_ID and requires the test service account to have a Cloud Storage admin role in the test project. Integration test self-manages its own UUID-suffixed bucket with defer-based cleanup.

PR Checklist

  • Make sure you reviewed CONTRIBUTING.md
  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea (communicated internally)
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)
  • Make sure to add ! if this involve a breaking change

What's included

  • New source: internal/sources/cloudstorage/ (+ YAML-parse unit tests)
  • Two tools: internal/tools/cloudstorage/cloudstoragelistobjects/, .../cloudstoragereadobject/ (+ YAML-parse + range-parser unit tests)
  • New cloudstoragecommon error classifier (+ 17-case unit test covering sentinels, HTTP statuses, context.Canceled/DeadlineExceeded, and fallback)
  • Integration test: tests/cloudstorage/cloud_storage_integration_test.go — 12 sub-tests against a real bucket (self-created, self-cleaned)
  • Docs: docs/en/integrations/cloud-storage/ (source + both tool pages; passes .ci/lint-docs-{source,tool}-page.sh)
  • CI shard: cloud-storage in .ci/integration.cloudbuild.yaml
  • Dependency: cloud.google.com/go/storage v1.62.1

Opening as draft for initial review — happy to split the error-classifier refactor into a separate commit if reviewers prefer.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds Google Cloud Storage integration, introducing a new source and tools for listing and reading objects. The implementation includes configuration, error handling, and tests. Feedback recommends capping listing page sizes at 1000 for consistency, implementing memory safety limits when reading objects, and updating documentation titles to include the 'Tool' suffix.

Comment thread internal/sources/cloudstorage/cloudstorage.go
Comment thread internal/sources/cloudstorage/cloudstorage.go
@huangjiahua huangjiahua marked this pull request as ready for review April 16, 2026 23:28
@huangjiahua huangjiahua requested a review from a team as a code owner April 16, 2026 23:28
…s and read_object tools

Adds a new project-scoped `cloud-storage` source using ADC, plus two read-only
tools: `cloud-storage-list-objects` (with prefix/delimiter/pagination) and
`cloud-storage-read-object` (with HTTP-style byte range and base64 payload).

Introduces a GCS-aware error classifier in `cloudstoragecommon` that splits
failures into Agent errors (missing bucket/object, bad request, unsatisfiable
range) and Server errors (auth, IAM denial, quota, 5xx, cancellation) per
DEVELOPER.md, replacing the coarse-grained `util.ProcessGcpError`.

Ships YAML-parse unit tests, an error-classifier unit test, a range-parser unit
test, a live-GCS integration test (12 sub-tests, UUID-suffixed bucket with
self-cleanup), docs under `docs/en/integrations/cloud-storage/`, and a
`cloud-storage` CI shard.

The remaining 12 tools from the approved design doc land in follow-up PRs.
…dObject at 1 MiB

- ListObjects: pageSize() now clamps to the GCS API max of 1000 so callers that
  pass a larger max_results don't pre-allocate oversized buffers.
- ReadObject: reject objects/ranges over 1 MiB with the new sentinel
  cloudstoragecommon.ErrReadSizeLimitExceeded, which the classifier maps to an
  Agent error so the LLM can retry with a narrower 'range'.
- Docs + integration tests updated (two new sub-tests: oversize rejection and
  oversize-narrowed-by-range success).
… MiB

8 MiB gives agents more headroom for typical text/JSON/log payloads while
still guarding against OOM. Doc and the oversize integration seed updated to
match.
…ckage

DefaultMaxReadBytes doesn't belong in errors.go — the limit is a source-side
invariant, not an error-classification concern. The sentinel
ErrReadSizeLimitExceeded stays in cloudstoragecommon because the classifier
still needs to recognize it.
…geSize bounds

Cleanup loop in the integration test was treating any iterator error as
iterator.Done; now distinguishes the two and logs non-Done errors so
flaky teardowns are debuggable. Also adds an internal unit test for
pageSize covering 0, negative, in-range, and over-cap inputs.
MCP tool results only carry text today, so the previous base64-encoded
content was unusable by the LLM. Validate object bytes with utf8.Valid
and return plain-text content; non-UTF-8 objects surface as an
agent-fixable ErrBinaryContent error. TODO notes mark the spots to
revisit once MCP supports embedded resources.
@huangjiahua huangjiahua force-pushed the feat/cloud-storage-source branch from 91a222a to 4919821 Compare April 17, 2026 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants